Database Management System - Indexing

2019-04-21

字数统计: 4.1k字 | 阅读时长≈ 20分钟

Indexing

基本概念

search key/搜索码/搜索键 注意这里的码的定义与主码、候选码以及超码中的码定义不同。这里是泛指。例如如果一个文件上有多个索引，那么它就有多个搜索码。一个文件可以有多个索引，分别基于不同的搜索码。
基本的索引类型:
- 顺序索引。基于值的顺序排序。
- 散列索引。基于将值平均分布到若干散列桶中。一个值所属的散列桶是由散列函数觉定。
顺序索引与散列索引评价metric。
- 访问类型: 范围查询？等值查询？
- 访问时间:在查询中使用该技术找到一个特定数据项或数据项集所需的时间
- 插入时间: 插入一个新数据项所需的时间。该值包括找到插入这个新数据项的正确位置所需的时间，以及更新索引结构所需的时间。
- 删除时间:删除一个数据项所需的时间。该值包括找到待删除项所需的时间，以及更新索引结构所需的时间。
- 空间开销:索引结构所占用的额外存储空间。倘若存储索引结构的额外空间大小适度，通常牺牲一定的空间代价来换取性能的提高是值得的。通常需要在一个文件上建立多个索引。

顺序索引

主索引/聚集索引 primary index / clustered/clustering index：
- 不一定是建立在主码，可以是任何码之上。
- 包含记录的文件与搜索码指定的顺序相同！
辅助索引/次级索引/非聚集索引 secondary index non-clustering index non-clustered index
- 包含记录的文件与搜索码指定的顺序不同！
稠密萦引 (dense index) :
- 稠密聚集索引：文件中的每个搜索码值都有一个索引项。索引项包括搜索码值以及指向具有该搜索码值的第一条数据记录的指针。具有相同搜索码值的其余记录顺序地存储在第一条数据记录之后，记录根据相同的搜索码值排序。
- 稠密非聚集索引中，索引必须存储指向所有具有相同搜索码值的记录的指针列表。
- 相比稀疏索引定位record更快
稀疏索引 (sparse index) : 在稀疏索引中，只为搜索码的某些值建立索引项。
- 只有索引是聚集索引时才能使用稀疏索引。
- 和稠密索引一样，每个索引项也包括一个搜索码值和指向具有该搜索码值的第一条数据记录的指针。为了定位一条记录，我们找到其最大搜索码值小于或等于所查找记录的搜索码值的索引
- require less space and impose less maintainance overhead for insertions/deletions项。然后从该索引项指向的记录开始，沿着文件中的指针查找，直到找到所需记录为止。
选择那种索引？
使用稀疏还是稠密索引是一个access time and space overhead tradeoff.
A good compromise is to have a sparse index with one index entry per block.
Because the dominant cost in processing a database request is the time that it takes to bring a block from disk into main memory. Once we have brought in the block, the time to scan the entire block is negligible. we minimize block accesses while keeping the size of the index (and thus our space overhead) as small as possible.

多级索引

If the relation instead had 100,000,000 tuples, the index would instead occupy 1,000,000 blocks, or 4 gigabytes of space. Such large indices are stored as sequential files on disk.
If an index is small enough to be kept entirely in main memory, the search time to find an entry is low. However, if the index is so large that not all of it can be kept in memory, index blocks must be fetched from disk when required.
The search for an entry in the index then requires several disk-block reads.
The process of searching a large index may be costly.

To locate a record, we first use binary search on the outer index to find the record for the largest search-key value less than or equal to the one that we desire. The pointer points to a block of the inner index. We scan this block until we find the record that has the largest search-key value less than or equal to the one that we desire. The pointer in this record points to the block of the file that contains the record for which we are looking.
In our example, an inner index with 10,000 blocks would require 10,000 entries in the outer index, which would occupy just 100 blocks. If we assume that the outer index is already in main memory, we would read only one index block for a search using a multilevel index, rather than the 14 blocks we read with binary search. As a result, we can perform 14 times as many index searches per second.
Indeed, we can repeat this process as many times as necessary. Indices with two or more levels are called multilevel indices. Searching for records with a multilevel index requires significantly fewer I/O operations than does searching for records by binary search.

索引更新

Regardless of what form of index is used, every index must be updated whenever a record is either inserted into or deleted from the file. Further, in case a record in the file is updated, any index whose search-key attribute is affected by the update must also be updated;
As a result we only need to consider insertion and deletion on an index, and do not need to consider updates explicitly.

Insertion: First, the system performs a lookup using the search-key value that appears in the record to be inserted. The actions the system takes next depend on whether the index is dense or sparse:
- Dense indices:
1. If the search-key value does not appear in the index, the system inserts an index entry with the search-key value in the index at the appropriate position.
2. Otherwise the following actions are taken:
  - a. If the index entry stores pointers to all records with the same search- key value, the system adds a pointer to the new record in the index entry.
  - b. Otherwise, the index entry stores a pointer to only the first record with the search-key value. The system then places the record being inserted after the other records with the same search-key values.
- Sparse indices: We assume that the index stores an entry for each block. If the system creates a new block, it inserts the first search-key value (in search-key order) appearing in the new block into the index. On the other hand, if the new record has the least search-key value in its block, the system updates the index entry pointing to the block; if not, the system makes no change to the index.
Deletion: To delete a record,the system first looks up the record to be deleted. The actions the system takes next depend on whether the index is dense or sparse:
- Dense indices:
1. If the deleted record was the only record with its particular search-key value, then the system deletes the corresponding index entry from the index.
2. Otherwise the following actions are taken:
  - a. If the index entry stores pointers to all records with the same search- key value, the system deletes the pointer to the deleted record from the index entry.
  - b. Otherwise, the index entry stores a pointer to only the first record with the search-key value. In this case, if the deleted record was the first record with the search-key value, the system updates the index entry to point to the next record.
- Sparse indices:
1. If the index does not contain an index entry with the search-key value of the deleted record, nothing needs to be done to the index.
2. Otherwise the system takes the following actions:
  - a. If the deleted record was the only record with its search key, the system replaces the corresponding index record with an index rec- ord for the next search-key value (in search-key order). If the next search-key value already has an index entry, the entry is deleted instead of being replaced.
  - b. Otherwise, if the index entry for the search-key value points to the record being deleted, the system updates the index entry to point to the next record with the same search-key value.

Insertion and deletion algorithms for multilevel indices are a simple extension of the scheme just described. On deletion or insertion, the system updates the lowest-level index as described. As far as the second level is concerned, the lowest-level index is merely a file containing records—thus, if there is any change in the lowest-level index, the system updates the second-level index as described. The same technique applies to further levels of the index, if there are any.

辅助索引

A clustering/primary index may be sparse 中间空过的搜索键值可以通过顺序扫描的方式获取
Secondary/nonclustered indices must be dense 辅助索引若是稀疏的，就不能保证sequential order, 中间相隔的搜索键值没有办法获取
基于候选键的辅助索引就像稠密聚集索引，除了：索引所指的连续的记录值实际上不是顺序储存的
通常情况下辅助索引可以有与聚集索引不同的结构：若聚集索引的搜索键不是候选键，索引只需指向有search key value的第一个记录即可，后续记录可以顺序scan。但对于搜索键不是候选键的非聚集索引，索引只指向第一个record with each search-key value是不够的, 剩余的相同search key value的记录可以存在于文件任何地方，因为文件是以聚集索引的搜索键排序的
所以，辅助索引必须包含所有记录的指针
我们可以使用额外一层间接的方式来实现基于非候选键的辅助索引：辅助索引里的指针指向bucket， bucket指针（们）指向文件记录

索引的自动创建：
- 大多数据库都会自动为主键创建索引（从而用于检测primary key constraint on new tuple insertion）若无主键索引，当插入元组时整个relation都要读一下来确保不违反primary-key constraint.
  我们无法储存既被聚集索引搜索键排序的又被辅助索引搜索键排序的文件！
  因为辅助索引键的顺序和物理键的顺序不同，如果我们按辅助索引键的顺序扫描那么读取一个记录就可能需要从磁盘读一个新的block 这非常慢！

上述索引更新方法插入删除也适用于次级索引。the actions taken are those described for dense indices storing a pointer to every record in the file. 如果一个文件有多个索引，每当文件被修改，所有索引都要更新

辅助索引可以提高基于非主索引key的query的查询效率，但对于数据库改动会造成很大的额外开支。数据库设计者需根据估计的各种queries频率和改动来决定是否使用一些辅助索引。
所以除非根据估计辅助索引可以带来性能提升，不要使用辅助索引：因为辅助索引是nonclustered index whose key order is different from physical-key order in file. 不同的index所指record可能会在不同的磁盘区块上，会导致额外磁盘IO访问！使用不当会效果很糟糕！

B+tree索引与哈希索引

index以block为单位进行index within block用offset

索引类型与基本特点

hash index 通过哈希函数生成hash address to a bucket with possible overflow chain for managing collision cheaper than B+tree if no overflow occurs Access: O(1+#overflow buckets)所以hash索引的最大的特点就是等值查询快，不能进行范围索引。
位图索引适合静态low-cardinality重复数据（属性），can be used as a compressed storage mechanism at the leaf nodes of B±trees for those values that occur very frequently.
B+tree 索引

Motivation

The primary disadvantage of the index-sequential file organization is that performance degrades as the file grows. To overcome this deficiency, we can use a B±tree index.

B+tree索引特点

普通balanced binary tree tall and thin, b tree fat and short because of the node size! b树索引同时支持范围及等值查询 log (N/2) N path length上届disk block access远远优于log2N普通平衡树 for N records in file 可以大量减少磁盘io次数

B tree: m-way(order m, m fanout, m-1info fields) search tree with additional constraints: 叶子层高度相同 root 2 key 其他节点至少半满ceiling(order/2)来尽量减少高度 B-tree indices are similar to B±tree indices. The primary distinction between the two approaches is that a B-tree eliminates the redundant storage of search-key values:A B-tree allows search-key values to appear only once.
B+ tree 更贴近多级索引，是在b树基础上, nonleaf node sparse index 减少disk page access 支持equality search 在叶子层将nonleaf节点key按中序遍历顺序拷贝下来叶子层包含record ptrs 保持中序遍历顺序建立链表形成dense & clustered index 密集聚集索引从而支持range search范围搜索
order=#ptr fields = p /node
#k,v fields = p-1 /node
(p-1)(key_ptr_size + record_ptr_size) + p(block_ptr_size) <= blocksize=512
若想要插入的位置已满 recursively按中序遍历顺序将中点上移同时将前驱后继节点分开始终保持节点半满的要求删除：左合并右合并来满足半满的限制 split if necessary can propagate to root. 重点：split colasce redistribution merge
bottom-up construction for empty B+tree index: 每一层每个节点使用最小值&最小值指针创建下一层entry until root is created
sort and then bulk-loading(insertion)

B+tree索引的缺点

long term performance degradation: b+tree索引或file organization的一个缺点是：相邻叶子节点可能存在于磁盘不同区域，最优情况是节点内指针遵循磁盘内容连续的分布，这样顺序扫描叶子结点基本等价于顺序扫描磁盘. 但随着越来越多的插入删除更新操作, sequentiality is increasingly lost, has to wait for disk seeks increasingly often. 这时候需要重建索引restore sequentiality
次级索引的更新问题: 拆分节点可能需要更新建立的次级索引，一个叶节点可能有成百上千条record 每一条都可能在不同地方 Thus a leaf-node split may require tens or even hundreds of I/O operations to update all affected secondary indices, making it a very expensive operation。解决方法： In secondary indices, in place of pointers to the indexed records, we store the values of the primary- index search-key attributes. For example, suppose we have a primary index on theattribute ID of relation instructor; then a secondary index on dept with each department name a list of instructor’s ID values of the corresponding records, instead of storing pointers to the records. locating a record using the sec- ondary index now requires two steps: First we use the secondary index to find the primary-index search-key values, and then we use the primary index to find the corresponding records. The above approach thus greatly reduces the cost of index update due to file reorganization, although it increases the cost of accessing data using a secondary index.

b+tree file organization

使用b+tree的思想可以创建b+tree file organization 因为不再储存ptr而是file record leaf node size更大需要更好的space utilization，可以通过提高节点容量下限来解决例如2n/3
B±tree file organizations可以用于储存大的对象如 SQL clobs and blobs, 这些东西通常比disk block大甚至好几个gb. 可以通过拆分成很多个小的record序列的方式使用b+tree file organization储存. The records can be sequentially numbered, or numbered by the byte offset of the record within the large object, and the record number can be used as the search key.

Hash Index

can be used for (strictly speaking) secondary indices/file organization,因为
- A hash index is never needed as a clustering index structure, since, if a file itself is organized by hashing, there is no need for a separate hash index structure on it.
- hash file organization provides the same direct access to records that indexing provides, we pretend that a file organized by hashing also has a clustering hash index on it.
open hashing with overflow chain for buckets: add chained bucket when need(conflict) usually used in database
closed hashing: the set of buckets is fixed. Deletion is troublesome: usually used in compiler and assembler because there is only lookup and insert.
- choose other bucket when a bucket is full: Linear probing, quadratic probing
- hash function must be choosed wisely otherwise performance degrades.
static hashing: linear congruential hash function with fixed #hash buckets use overflow chain to manage contention
- problem: performance degrades as database grows in size if no reorganziation of hash structure is done (remap everything & reconstruct corresponding buckets, time-comsuming & massive)
extendible/dynammic hashing:
- copes with changes in database size by splitting and coalescing buckets as the database grows and shrinks. Does so by using extra level of directory that double its size when local depth = global depth during insertion to buckets & bucket overflow (bits are not enough to distinguish the search values of the overflown bucket.). use directory of size 2^k to store ptrs to hash buckets. 扩容numbering使用gray code. Directory numbering last k bits 0 - 2^k
- hash function: check out last k bits / mod 2^k
- global depth = k: Max # of bits needed to tell which bucket an entry belongs to in the directory
- local depth: # of bits used in directory numbering to determine if an entry belongs to this bucket or not
- If a bucket overflow happens, the bucket is split into two. The directory may or may not double, depending on whether the local depth of the overflown bucket was equal to the global depth before split.
- After resize, we do not necessarily have new buckets, match existing directory to current buckets. Meaning usually a bucket is pointed by 2 entries or more. A bucket is only pointed by 1 when bucket overflow due to insertion and this situation is after spliting.
- https://www.youtube.com/watch?v=TtkN2xRAgv4&list=PLkZdHIQy-AeJjLbvcLO-rp1-eImyJvO2l
- http://delab.csd.auth.gr/papers/ExtendibleHashing2017.pdf
- http://www.cs.sfu.ca/CourseCentral/354/lxwu/notes/chapter11.pdf

Multiple key indices

in general a search key can have more than one attribute. A search key containing more than one attribute is referred to as a composite search key. The structure of the index is the same as that of any other index, the only difference being that the search key is not a single attribute, but rather is a list of attributes. The search key can be represented as a tuple of values, of the form (a1, . . . , an), where the indexed attributes are A1, . . . , An. The ordering of search-key values is the lexicographic ordering.
As an example. Consider an index on takes relation on composite search key(courseid,semester,year). This index is useful in finding all students who have registered for a particular course in a particular semester/year. An ordered index on multiple-key can be also used to answer some other queries efficiently.

打赏

版权声明： 本博客所有文章除特别声明外，均采用 Apache License 2.0 许可协议。转载请注明出处！